0. Preliminars 📖

Introduction

To impute the harmonized dataset, we followed the next steps:

  1. Data preprosessing.
  • Data exploration.
  • Data correction.
  • Creation of new variables.
  1. Imputation process.
  • Selection of variables in the imputation model.
  • Setting of imputation methods.
  • Post processing.
  • Setting of additional imputation parameters.
  • Convergence evaluation.
  • Sensitivity analysis.

Folder structure 📂

The following graph shows folder tree of our project.

drawing

In the “0_Documentation” subfolder are saved the current Rmarkdown code along with its html version. Inside the “1.Input_data” subfolder are placed the data resulting from the harmonization process in csv format and a xlsx file with the list of proposed variables to include in the imputation model.

The “2_Code” subfolder contains the original R code for the data processing, for imputation process (including the codes to sent to HPC computer and to merge the resulting iterations) and convergence analysis. Relevant output files and graphs of our job are saved on the “3_Output_data” and “4_Graphs” subfolders.

Program specification 💻

We run all codes in R version 4.1.0 on a macOS computer and on a linux based High performance computer.

Load packages and input information

Load the required R package

rm(list=ls()) # clean environment

# Data manipulation packages
library(here) 
library(data.table)
library(dplyr)

# Imputation packages
library(mitml)
library(mice)
library(micemd)
library(miceadds)

# Graphic packages
library(corrplot)
library(ggplot2)
library(ggtext)
library(plotly)

# Field specific package
library(growthstandards)

Load dataset and other source of information

By using the “here” package, it is possible to define relative paths in the project folder. Therefore the user only needs to update the input files in the respective subfolder and run the code without require any further modification.

data <- as.data.table(read.csv(here('1_Input_data','ALLDATASET_preliminary.csv')))
index <- as.data.table(readxl::read_xlsx(here('1_Input_data','Index.xlsx'),sheet="Sheet1")) 
#It specifies the variables to include according to Lauren and Anneke criteria.
nindex <- as.data.table(readxl::read_xlsx(here('1_Input_data','Index.xlsx'),sheet="Sheet2")) #It specifies the order of the variables included in the imputation model 
index <- index[Bkey%in%c(1,2)]  # Index of variables proposed by Lauren and Anneke.
vary <- c("studyname",index[,Variable]) # Get a vector with the proposed variables.
data <- as.data.table(data[,..vary]) # Subset the dataset with the proposed variables.

1. Data exploration 🔎

Initially we check the data set to identify possible typing errors and outliers. We separately check the tree group of input variables: Outcome, Exposure, Pregnant woman.

Redefine NA values

In order to apply the imputation methods on the incomplete database, we need to set all unspecified values (coded as 666, 777, 888, 999) to NA. R does not handle NA values in the same way as SAS, and specifically the mouse package only imputes variables with NA values. Before converting to NA values, we need to check if there is any NA specification that can be used to retrieve information from a particular variable. For example the 777 level of the arb_cling variable is used to further define the CZS variable.

data[, arb_clindiag := ifelse(arb_clindiag==777,6,arb_clindiag)] # We set the NA value ("777") to a level "6" to use it in the creation of CZS variable.

Then with the following instruction, we can replace all the NA values for all variables.

data[data == 666] <-  NA
data[data == 777] <-  NA
data[data == 888] <-  NA
data[data == 999] <-  NA
data[data == 9999] <-  NA

Check categorical variables

char <- which(sapply(data, is.character)) #wwhich variable are categorical
datac <- data[,..char]
# Transform observation with string format
data[, inf_head_circ_age_fu1 := as.numeric(sub(" .*", "",data$inf_head_circ_age_fu1))] # remove years at the end of the string
data[, zikv_elisa_ga_1 := as.numeric(sub(" .*", "",data$zikv_elisa_ga_1))] # remove weeks at the end of the string
data[ zikv_pcr_date_1 %in% c("22027","888",""), zikv_pcr_date_1 := NA] # set as NA values not given in date format
data[ zikv_elisa_date_1 %in% c(""),zikv_elisa_date_1 := NA] #set blank strings as NA 
data[ symp_date%in%c(""), symp_date:=NA]

Check outcome variables

#unique(data$othabnorm_spec) # 39 categories 
data[,othabnorm_spec := NULL] # We temporal remove this variable from the dataset as contains messy information: 39 categories in total
data[inf_weight < 100, inf_weight := inf_weight*100] #Correct 2 measures with a misplaced period
data[inf_weight > 10000, inf_weight := inf_weight/10]#Correct 1 measure with an extra zero 
data[,inf_sex := ifelse(inf_sex==3,NA,inf_sex)] # Assing 3 to NA, as this level is not defined
data[inf_length == 0,inf_length := NA] #Assign to NA all negative lengths
data[inf_head_circ_birth == 0,inf_head_circ_birth := NA] #Assign to NA all circumference with 0 value
data[,endga := ifelse(!is.na(endga),endga,birth_ga)] #end_ga with NA values can be replaced with observed  birth_ga values.
data[fet_us_micro_tri2 == 1|fet_us_micro_tri3 == 1, microcephaly_bin := 1] # Microcephaly binary can be defined from fet_us_micro_tri2 and fet_us_micro_tri3 values
data[loss_etiology == 4,loss_etiology := NA] # Level 4 not defined for loss_etiology

Check exposures

By checking the inclusion criteria of each study it was found that the following studies only included women with detected zika infection (Source: Lauren code).

  • Brazil_BahiaPaudaLima_Costa
  • Brazil_SP_RibeiraoPreto_Duarte
  • TrinidadTobago_Sohan

Therefore, one could assign the zika prevalence (zikv_prev = 1) for all women belonging to these studies. Note: According to protocol (Anneke suggestion), that the following studies applied also the same inclusion criteria.

  • Brazil_RiodeJaneiro_CunhaPrata
  • Spain_Soriano
  • FrenchGuiana_Pomar
  • Colombia_Mulkey

However by inspecting the data we found that on these countries there are a high proportion of women with a negative zika test (zikv_preg = 0 and in the following table defined as zikv_neg).

zik_incA<-c("Brazil_RiodeJaneiro_CunhaPrata","Brazil_SP_RibeiraoPreto_Duarte","Brazil_BahiaPaudaLima_Costa","Spain_Soriano","FrenchGuiana_Pomar","TrinidadTobago_Sohan","Colombia_Mulkey")

data[, .(count = .N,
                ziv_pos=sum(ifelse(zikv_preg==1,1,0),na.rm=TRUE),
                ziv_neg=sum(ifelse(zikv_preg==0,1,0),na.rm=TRUE),
                ziv_na=sum(ifelse(is.na(zikv_preg),1,0),na.rm=TRUE),
                inclusion=ifelse(studyname%in%zik_incA,1,0))
                , by = studyname] 
##                          studyname count ziv_pos ziv_neg ziv_na inclusion
##  1: Brazil_RiodeJaneiro_CunhaPrata    54       1       4     49         1
##  2: Brazil_SP_RibeiraoPreto_Duarte   513     513       0      0         1
##  3:    Brazil_BahiaPaudaLima_Costa    46      13      33      0         1
##  4:                  Spain_Soriano   278      27     206     45         1
##  5:             FrenchGuiana_Pomar   291     291       0      0         1
##  6:           TrinidadTobago_Sohan   104      83       3     18         1
##  7:       Brazil_RiodeJaneiro_Joao   219     219       0      0         0
##  8:                  Spain_Bardaji   198       4     194      0         0
##  9:                Colombia_Mulkey    85      77       0      8         1
## 10:                     USA_Mulkey    95      61      17     17         0

For the moment, we have chosen to apply only the inclusion criteria in the studies specified by Lauren.

zik_inc<-c("Brazil_RiodeJaneiro_CunhaPrata","Brazil_SP_RibeiraoPreto_Duarte","TrinidadTobago_Sohan")
data[studyname%in%zik_inc,zik_preg:=1]

Negative values of the following variables were assinged to NA

data[ miscarriage_ga<=0, miscarriage_ga:=NA]# codebook assigned to 666
data[ loss_ga<=0, loss_ga := NA]# codebook assigned to 666
data[ birth_ga<=0, birth_ga:=NA]# codebook assigned to 666       
data[ zikv_pcr_ga_1<=0,zikv_pcr_ga_1:=NA] # codebook assigned to 666
data[ zikv_elisa_ga_1<=0,zikv_elisa_ga_1:=NA] # codebook assigned to 666
data[ zikv_ga<=0,zikv_ga:=NA]# codebook assigned to 666
data[ symp_ga<=0,symp_ga:=NA]# codebook assigned to 666
data[ arb_clindiag_ga<=0,arb_clindiag_ga:=NA]# codebook assigned to 666

Observations without a defined date of zika test, but with a date for pcr or elisa test were assigned with the earlier date of the available test dates.

data[, zikv_ga_min:=apply(data[,c("zikv_elisa_ga_1","zikv_pcr_ga_1")], 1, min, na.rm = TRUE)]
data[, zikv_ga_min:=ifelse(is.infinite(zikv_ga_min),NA,zikv_ga_min)]
data[, zikv_gan:=ifelse(is.na(zikv_ga),zikv_ga_min,zikv_ga_min)]
data[, zikv_gan:=ifelse(is.na(zikv_ga),zikv_ga_min,zikv_ga_min)]
data[, zikv_ga_min:=NULL]
data[, zikv_trin:=ifelse(is.na(zikv_gan),zikv_tri,ifelse(zikv_gan<13,0,ifelse(zikv_gan<=27,1,2)))]

Check pregnant woman variables

data[age<14, age:=NA] #  ask "Spain_Soriano" for this age records, for the moment assigned to NA
data[tobacco==3,tobacco:=NA] # level 3 not defined for tobacco variable

2. Data correction

2.0. Microcephaly variable

@ Anneke could you please add the information related to this variable?

data[!is.na(inf_sex), hcircm2zscore := as.numeric(igb_hcircm2zscore(gagebrth = birth_ga*7,
                                                                    hcircm=inf_head_circ_birth,
                                                                    sex=ifelse(inf_sex== 0, "Male","Female")))]  
data[, microcephaly_temp := ifelse(hcircm2zscore<=-3,2,
                            ifelse(hcircm2zscore<=-2,1,
                            ifelse(hcircm2zscore<=2,0,
                            ifelse(!is.na(hcircm2zscore),3,NA))))]

data[ is.na(microcephaly), microcephaly := microcephaly_temp]
check_microcephaly <- as.data.table(table(mic_bin=data$microcephaly_bin,
                                          micro=data$microcephaly,
                                          micro_temp=data$microcephaly_temp,
                                          useNA = "always"))
check_microcephaly[N>0]  #check this inconsistency.. we priorirized microcephaly variable over microcephaly_bin 
##     mic_bin micro micro_temp   N
##  1:       0     0          0 491
##  2:       1     0          0  10
##  3:    <NA>     0          0  86
##  4:       1     1          0   3
##  5:       0     1          1   1
##  6:       1     1          1   8
##  7:    <NA>     1          1   2
##  8:       0     2          2   3
##  9:       1     2          2   9
## 10:    <NA>     2          2   1
## 11:       0     0          3  49
## 12:    <NA>     3          3  16
## 13:       0     0       <NA> 628
## 14:       1     0       <NA>   3
## 15:       0     1       <NA>   2
## 16:       1     1       <NA>  32
## 17:       0     2       <NA>   1
## 18:       1     2       <NA>  10
## 19:       0     3       <NA>   7
## 20:       0  <NA>       <NA> 162
## 21:       1  <NA>       <NA>   8
## 22:    <NA>  <NA>       <NA> 351
##     mic_bin micro micro_temp   N
data[,microcephaly_bin:=ifelse(microcephaly%in%c(0,3),0,ifelse(microcephaly%in%c(1,2),1,microcephaly_bin))]

2.1. CZN

For facilitate the construction of anormality variables we created the following function

#Function returns 1 if there is any anomaly detected and 0 if there were no anomaly detected across the recorded variables , NA no recorded variables.

checkcon<-function(data,col1){ #
  data$anyT <- rowSums(data[, .SD, .SDcols = col1], na.rm=T)
  data$allNA <- rowSums(!is.na(data[, .SD, .SDcols = col1]))
  data$Final<- ifelse(data$allNA==0,NA,ifelse(data$anyT>0,1,0))
  return(data$Final)
}

Define neuroimaging abnormalities

data[,modificated1:=ifelse(fet_us_abn_spec_tri1==0,1,NA)] 
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==0,1,NA)] 
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==0,1,NA)] 
col1<-c("modificated1","modificated2","modificated3","hydrocephaly","calcifications","ventriculomegaly","fet_us_cns_tri2","fet_us_cns_tri3")
data[,neuroabnormality:=checkcon(data=data,col1=col1)]

Define congenital contractures

data[,modificated1:=ifelse(fet_us_abn_spec_tri1==1,1,NA)] 
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==1,1,NA)] 
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==1,1,NA)] 
col1<-c("modificated1","modificated2","modificated3","fet_us_msk_tri2","fet_us_msk_tri3")
data[,contractures:=checkcon(data=data,col1=col1)]

Define cardio abnormalities

data[,modificated1:=ifelse(fet_us_abn_spec_tri1==2,1,NA)] 
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==2,1,NA)] 
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==2,1,NA)]
col1<-c("modificated1","modificated2","modificated3","fet_us_cardio_tri2","fet_us_cardio_tri3")
data[,cardioabnormality:=checkcon(data=data,col1=col1)]

Define gastrointestinal abnormalities

data[,modificated1:=ifelse(fet_us_abn_spec_tri1==3,1,NA)] 
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==3,1,NA)] 
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==3,1,NA)] 
col1<-c("modificated1","modificated2","modificated3","fet_us_gastro_tri2","fet_us_gastro_tri3")
data[,gastroabnormality:=checkcon(data=data,col1=col1)]

Define orofacialintestinal abnormalities

data[,modificated1:=ifelse(fet_us_abn_spec_tri1==4,1,NA)] 
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==4,1,NA)] 
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==4,1,NA)] 
col1<-c("modificated1","modificated2","modificated3","fet_us_orofac_tri2","fet_us_orofac_tri3")
data[,oroabnormality:=checkcon(data=data,col1=col1)]

Define ocular abnormalities

data[,modificated1:=ifelse(fet_us_abn_spec_tri1==5,1,NA)] 
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==5,1,NA)] 
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==5,1,NA)] 
col1<-c("modificated1","modificated2","modificated3","fet_us_eyeear_tri2","fet_us_eyeear_tri3")
data[,ocularabnormality:=checkcon(data=data,col1=col1)]

Define genitourinaly abnormalities

data[,modificated1:=ifelse(fet_us_abn_spec_tri1==6,1,NA)] 
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==6,1,NA)] 
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==6,1,NA)] 
col1<-c("modificated1","modificated2","modificated3","fet_us_genur_tri2","fet_us_genur_tri3")
data[,genurabnormality:=checkcon(data=data,col1=col1)]

Define any congenital abnormality excluding microcephaly

col1<-c("neuroabnormality","contractures","cardioabnormality","gastroabnormality","oroabnormality","ocularabnormality","genurabnormality","othabnorm","fet_us_bin_tri1","fet_us_bin_tri2","fet_us_bin_tri3")
data[,anyabnormality_czs:=checkcon(data=data,col1=col1)]
col1<-c(col1,"anyabnormality_czs")

Notes @Anneke

  • anyabnormality, nonneurologic: I did not calculate it as is the combination of anyabnormality_czs and microcephaly so you can include the anyabnormality in the imputation
  • we need to check inclusion of neuronormality,ocularnormality, contractures and anyabnormalyty separaterly as all are combined in anyabnormality_czs

Create a czs variable according to WHO definition

WHO definition for CZS: Presence of confirmed maternal or fetal ZIKV infection AND presence of severe microcephaly AND presence of other malformations (including limb contractures, high muscle tone, eye abnormalities, and hearing loss, nose etc.)

data[, czs2 := ifelse((data$zikv_preg==1 | data$fet_zikv==1) & 
                    data$microcephaly==2 & data$anyabnormality_czs==1, 1,
                   ifelse( data$zikv_preg==0 & data$fet_zikv==0 & 
                            data$microcephaly!=2 & data$anyabnormality_czs==0,
                            0,NA))] 
data[, czsn := ifelse(is.na(czs),czs2,czs)]
data[, czs2 := NULL]

checktable <- data.table(table(czs=data$czs,czsn=data$czsn,useNA = "always"))
checktable[ N>0 ] #Combinations where the czv calculated differs from the given one. czsn gave information of around 100 observations 
##     czs czsn   N
## 1:    0    0 790
## 2: <NA>    0  93
## 3:    1    1  40
## 4: <NA>    1   3
## 5: <NA> <NA> 957

3. Create additional variables

3.0. Bdeath and Bdeath_ga

The original dataset contains variables that specify whether the pregnancy stage ends with the birth or death of the baby. When the baby dies the event is classified as miscarriage, loss depending on the time at which the pregnancy ends. In addition, related variables such as pregnancy termination, loss etiology or induced abortion are specified. Many of these variables have a high proportion of missing values which can be avoided by combining them into a time variable and an overall event indicator. In other words, instead of having separate indicator variables for death, miscarriage and lost and time variables associated with each of them, one can have an indicator variable, “bdeath”, that specifies whether the baby was born (bdeath=0) or died (bdeath=1), with a gestational time variable that indicates when the event occurred (bdeath_ga).

We start by combining the information of the variables miscarriage, loss, loss_etiology, birth.

data[,birth:=ifelse(!is.na(birth_ga),1,NA)] #birth indicator from bithga
data[,birth2:=ifelse(!is.na(inf_term),1,NA)] #birth indicator from inf_term
data[,bdeath:=ifelse(birth==1|birth2==1,0,NA)]
data[,bdeath_ga:=ifelse(!is.na(endga),endga,ifelse(!is.na(loss_ga),loss_ga,ifelse(!is.na(miscarriage_ga),miscarriage_ga,NA)))]

Then we used the etiology variable to add information on NA observations or ratify information in other pregancy term variables

The following data.table instruction allows us to assign the values on misscarriage, loss, loss_etiology, birth and bdeath in one step given a loss etiology condition.

#Etiology levels (0=live, 1=miscarriage, 2=loss, 3=imp death)

data[loss_etiology==0&(is.na(bdeath)|bdeath==0),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)] #birth
data[loss_etiology==1&(is.na(bdeath)|bdeath==1),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(1,0,1,0,1)] #miscarriage
data[loss_etiology==2&(is.na(bdeath)|bdeath==1),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)] #loss
data[loss_etiology==3&(is.na(bdeath)|bdeath==1),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)] #class as loss

Here indicators of bdeath are ratified with information of bdeath_ga

data[bdeath==1&bdeath_ga<20,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(1,0,1,0,1)]
data[bdeath==1&bdeath_ga>=20,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)]

Observations with loss =1 and birth=1, allows us to specify other variables

data[loss==1&is.na(loss_etiology),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)]
data[birth==1&is.na(loss_etiology),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]

Then we used inf_vital status to ratify information or correct contradictory observations

data[miscarriage==0&loss==0&is.na(birth)&inf_vital_status==0,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]
data[birth==1&loss_etiology==1&inf_vital_status==0,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]
data[birth==0&inf_vital_status==0,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]

Finally we use the inf_term and induceabort to correct or compleate some observations

data[!is.na(data$inf_term),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]
data[inducedabort==1&birth==1,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)]

We combine the information of all the pregnancy term variables and check if there are observations with inconsistente relationships and remove temporal variables not need it in the final dataset.

#Coherent with inf_alive_birth 0=alive, induce abort 0=No
checktable<-data.table(table(miscarriage=data$miscarriage,
                             loss=data$loss,
                             etiology=data$loss_etiology,
                             birth=data$birth,
                             bdeath=data$bdeath,
                             vstatus=data$inf_vital_status,
                             abort=data$inducedabort,
                             infterm=!is.na(data$inf_term), useNA="always")) #V1 is miscarriage, V2 is loss and N the number of observations
checktable[N!=0,]
##     miscarriage loss etiology birth bdeath vstatus abort infterm    N
##  1:           0    0        0     1      0       0     0   FALSE   46
##  2:           0    0        0     1      0    <NA>     0   FALSE  217
##  3:           1    0        1     0      1    <NA>     0   FALSE   36
##  4:           0    1        2     0      1    <NA>     0   FALSE   11
##  5:           0    0     <NA>  <NA>   <NA>    <NA>     0   FALSE    8
##  6:           0 <NA>     <NA>  <NA>   <NA>    <NA>     0   FALSE    6
##  7:           1    0        1     0      1    <NA>     1   FALSE    2
##  8:           0    1        2     0      1    <NA>     1   FALSE   14
##  9:           0    0        0     1      0    <NA>  <NA>   FALSE    3
## 10:           1    0        1     0      1    <NA>  <NA>   FALSE    1
## 11:           0    1        2     0      1    <NA>  <NA>   FALSE   17
## 12:        <NA> <NA>     <NA>  <NA>   <NA>    <NA>  <NA>   FALSE   58
## 13:           0    0        0     1      0       0     0    TRUE 1311
## 14:           0    0        0     1      0       1     0    TRUE    4
## 15:           0    0        0     1      0       2     0    TRUE    3
## 16:           0    0        0     1      0    <NA>     0    TRUE   91
## 17:           0    1        2     0      1    <NA>     1    TRUE    3
## 18:           0    0        0     1      0    <NA>  <NA>    TRUE   52
data[,birth:=NULL]
data[,birth2:=NULL]

3.1. Maternal prescription drug use

col1<-c("med_bin","med_anticonvuls_bin","med_preg_bin","med_fertil_bin")
data[,drugs_prescr:=checkcon(data=data,col1=col1)]

3.2. Maternal vaccination

col1<-c("vac_rub_enroll","vac_vari_enroll","vac_yf_enroll")
data[,vaccination:=checkcon(data=data,col1=col1)]

3.3. Intrauterine exposure to storch pathogens

data[,modificated1:=ifelse(storch==0,0,ifelse(!is.na(storch),1,0))]
col1<-c("modificated1","storch_bin","toxo","toxo_treat","syphilis","syphilis_treat","varicella","parvo","rubella","cmv","herpes","listeria","chlamydia","gonorrhea","genitalwarts")
data[,storch_patho:=checkcon(data=data,col1=col1)]

3.4.Concurrent or prior flavi- or alpha virus infection

data[,modificated1:=ifelse(arb_clindiag_plus==0,0,ifelse(!is.na(arb_clindiag_plus),1,0))]
data[,modificated2:=ifelse(arb_clindiag!=0&arb_clindiag!=1,0,ifelse(!is.na(arb_clindiag),1,0))] #arb_clindiag==777 to 6 on top
col1<-c("modificated1","modificated2","denv_ever","chikv_ever")
data[,flavi_alpha_virus:=checkcon(data=data,col1=col1)]

3.5.Prior arb virus infection

data[,modificated1:=ifelse(zikv_pcr_everpos==1,1,ifelse(zikv_pcr_everpos==0,0,NA))] # 2 indeterminated
col1<-c("modificated1","zikv_elisa_everpos","denv_ever","chikv_ever")
data[,arb_ever:=checkcon(data=data,col1=col1)]

3.6.Current arb virus infection

data[,modificated1:=ifelse(zikv_pcr_res_1==1,1,ifelse(zikv_pcr_res_1==0,0,NA))] # 2 indeterminated
data[,modificated2:=ifelse(zikv_elisa_res_1==1,1,ifelse(zikv_elisa_res_1==0,0,NA))]
data[,modificated3:=ifelse(arb_clindiag==0,0,ifelse(!is.na(arb_clindiag),1,NA))] #arb_clindiag==777 to 6 modified at the beginning= other arbovirus
col1<-c("modificated1","modificated2","modificated3","zikv_preg","denv_preg","chikv_preg")
data[,arb_preg:=checkcon(data=data,col1=col1)]

3.7. Arb virus infection on current pregnancy without consider zika

data[,modificated1:=ifelse(arb_clindiag==0|arb_clindiag==1,0,ifelse(!is.na(arb_clindiag),1,NA))] #arb_clindiag==777 to 6 modified at the beginning= other arbovirus
col1<-c("modificated1","denv_preg","chikv_preg")
data[,arb_preg_nz:=checkcon(data=data,col1=col1)]

4. Selection of variables in imputation model

4.1. Influx- Outflux plot

Van Buuren proposed the influx and outflux values as selection criteria of predictor variables. The influx is a measure of how the missing observations of a variable is connected with observed data of other variables. Whereas the outflux value is provide an estimation of how the observed data of a variable is connected with incomplete data of the other variables. The outflux parameter is an indicator of potential power of a variable for imputing other variables. In this case we define that a variable could be a good predictor if it outflux value is equal or superior to 50% .i.e. could help to predict more that the 50% of variables with incomplete data.

fx<-flux(data)
fluxplot(data,eqscplot=TRUE)

outlist<-row.names(fx)[fx$outflux>=0.5]
sort(outlist)
##  [1] "age"                 "anyabnormality_czs"  "arb_ever"           
##  [4] "arb_preg"            "arb_symp"            "bdeath"             
##  [7] "bdeath_ga"           "birth_ga"            "conjunctivitis"     
## [10] "endga"               "fever"               "gravidity"          
## [13] "hiv"                 "hydrocephaly"        "inducedabort"       
## [16] "inf_head_circ_birth" "inf_sex"             "inf_term"           
## [19] "inf_vital_status"    "inf_weight"          "loss"               
## [22] "loss_etiology"       "microcephaly"        "microcephaly_bin"   
## [25] "mid_original"        "miscarriage"         "multiplegest"       
## [28] "neuroabnormality"    "parity"              "rash"               
## [31] "storch"              "storch_bin"          "storch_patho"       
## [34] "studyname"           "symp_tri"            "ventriculomegaly"   
## [37] "zikv_gan"            "zikv_pcr_date_1"     "zikv_pcr_everpos"   
## [40] "zikv_pcr_ga_1"       "zikv_pcr_res_1"      "zikv_pcr_tri_1"     
## [43] "zikv_preg"           "zikv_trin"

4.2. Percentage of missingness for each variable across studies

We calculate the proportion of missingess of each varable on each study and place it in a matrix format that allow us to plotit.

nindex1 <- nindex[order(order),][!is.na(order)] # order according categories:  pregnancy women, exposure, outcome
pldata<-data[, .SD, .SDcols = nindex1$names]
dmatrix<-pldata[, lapply(.SD, function(x) sum(is.na(x))/.N), studyname] #calulate % missingness
dmatrix2<-melt(dmatrix,id.vars="studyname") # change study name to display
dmatrix2[,name:=fcase(
  studyname=="Brazil_RiodeJaneiro_CunhaPrata","Brazil\nCunhaPrata",
  studyname=="Brazil_SP_RibeiraoPreto_Duarte","Brazil\n Duarte",
  studyname=="Brazil_BahiaPaudaLima_Costa","Brazil\nCosta",
  studyname=="Spain_Soriano" ,"Spain\nSoriano",
  studyname=="FrenchGuiana_Pomar" ,"Fr.Guiana\nPomar",
  studyname=="TrinidadTobago_Sohan" ,"Tri.Tobago\nSohan",
  studyname=="Brazil_RiodeJaneiro_Joao" ,"Brazil\nJoao",
  studyname=="Spain_Bardaji" ,"Spain\nBardaji",
  studyname=="Colombia_Mulkey"  ,"Colombia\nMulkey",
  studyname=="USA_Mulkey"  ,"USA\nMulkey"
)]

dmatrix2[,miss:=round(value*100,1)]
dmatrix2[,text:=paste0("study: ", studyname, "\n", "variable: ", variable, "\n", "miss%: ",miss)] # set the text displayed by place the cursor over especific observation.

The ploty package allows user to zoom out sections of the plot, by placing the cursor over an observation is diplayed the % missing by study.

names<-nindex1$names
color<-ifelse(nindex1$Mod_var == 1, "red", "black")
nindex1$color<-color

p<-ggplot(dmatrix2, aes(x=name, y=variable,fill=miss,text=text)) +
  geom_tile() +
  scale_fill_gradientn(colours=c("green","yellow","red")) +
  theme(axis.text.x = element_text(angle = 45,vjust=0.5,size=8),
        axis.text.y = element_text(size=8))+xlab("Study name")+ylab("Variable")+
  labs(fill='%missing') 

ggplotly(p, tooltip="text")

5. Imputation methods

We subset the dataset with the selected variables.

nindexf<-nindex[Final==1,]
fdata<-data[, .SD, .SDcols = nindexf$names]
summary(fdata)
##   inducedabort         bdeath          bdeath_ga      hydrocephaly   
##  Min.   :0.00000   Min.   :0.00000   Min.   : 7.00   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:38.00   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :39.00   Median :0.0000  
##  Mean   :0.01084   Mean   :0.04638   Mean   :38.13   Mean   :0.0035  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:40.00   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :47.30   Max.   :1.0000  
##  NA's   :131       NA's   :72        NA's   :656     NA's   :741     
##  corticalatrophy  calcifications   ventriculomegaly   othabnorm     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.0195   Mean   :0.0229   Mean   :0.0413   Mean   :0.0704  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :1010     NA's   :1011     NA's   :745      NA's   :1215    
##   microcephaly    neuroabnormality anyabnormality_czs      czsn       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000     Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000     1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000     Median :0.0000  
##  Mean   :0.1211   Mean   :0.1088   Mean   :0.1616     Mean   :0.0464  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000     3rd Qu.:0.0000  
##  Max.   :3.0000   Max.   :1.0000   Max.   :1.0000     Max.   :1.0000  
##  NA's   :521      NA's   :679      NA's   :621        NA's   :957     
##  igr_curr_preg     multiplegest       inf_sex         inf_weight  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   : 100  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2868  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :3175  
##  Mean   :0.0447   Mean   :0.0202   Mean   :0.4745   Mean   :3138  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3500  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :7000  
##  NA's   :1123     NA's   :299      NA's   :686      NA's   :500   
##    inf_length    inf_head_circ_birth   zikv_preg         zikv_gan     
##  Min.   : 4.00   Min.   :15.00       Min.   :0.0000   Min.   :  2.00  
##  1st Qu.:47.50   1st Qu.:33.00       1st Qu.:0.0000   1st Qu.: 15.14  
##  Median :49.00   Median :34.00       Median :1.0000   Median : 23.43  
##  Mean   :48.58   Mean   :33.76       Mean   :0.7383   Mean   : 23.24  
##  3rd Qu.:50.00   3rd Qu.:35.00       3rd Qu.:1.0000   3rd Qu.: 31.00  
##  Max.   :54.00   Max.   :47.50       Max.   :1.0000   Max.   :111.14  
##  NA's   :956     NA's   :569         NA's   :137      NA's   :850     
##      fever             rash        conjunctivitis      arb_symp     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.2336   Mean   :0.4934   Mean   :0.1616   Mean   :0.5791  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :329      NA's   :286      NA's   :336      NA's   :403     
##     arb_ever       arb_preg_nz          age          gravidity     
##  Min.   :0.0000   Min.   :0.0000   Min.   :14.00   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:23.00   1st Qu.: 1.000  
##  Median :1.0000   Median :0.0000   Median :28.00   Median : 2.000  
##  Mean   :0.8925   Mean   :0.0996   Mean   :27.74   Mean   : 2.528  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:32.00   3rd Qu.: 3.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :49.00   Max.   :14.000  
##  NA's   :69       NA's   :869      NA's   :546     NA's   :665     
##      parity            hiv          storch_patho     studyname        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Length:1883       
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median : 1.000   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   : 1.673   Mean   :0.1728   Mean   :0.4404                     
##  3rd Qu.: 2.000   3rd Qu.:0.0000   3rd Qu.:1.0000                     
##  Max.   :12.000   Max.   :1.0000   Max.   :1.0000                     
##  NA's   :681      NA's   :581      NA's   :675

Then we specify vector for each type of variable

ncon<-c("age","bdeath_ga","zikv_gan","inf_weight","inf_head_circ_birth","inf_length") # name of continuous variables
nncon<-colnames(data)[!colnames(data)%in%ncon] # name of no continuous variables
ncat<-c("gravidity","parity","microcephaly") #name of categorical variables
nbin<-nncon[!nncon%in%ncat] #name of binary variables

We initialize a mice object to later on set the methods, prediction and post arguments

ini <- mice(data, maxit = 0)
## Warning: Number of logged events: 42
meth <- ini$meth
pred <- ini$pred
post <- ini$post

In general we assigned clusterized methods for all variables, however we set marginalized methods for those variables which their distribution does not vary across studies or that result in convergency problems.

meth[ncon] <- "2l.norm" # normal distribution at study level
meth[c("inf_weight","inf_head_circ_birth","inf_length","zikv_gan")]<-"norm" # Due to convergence problems imputation based on marginal distribution
meth[ncat] <- "2l.pmm" # predictive mean matching at study level
meth[c("na_b")] <- "2l.pmm" # set with ppm at study level 
meth[nbin] <- "2l.bin"#  binomial distribution at study level
meth[c("corticalatrophy","ventriculomegaly","multiplegest","othabnorm","calcifications")] <- "logreg" # Due to frequent singular fit warning imputation based on marginal distribution

To construct the prediction matrix, we used as predictors all variables that have a Pearson correlation with the imputed variable of at least 0.3. In addition, we specify the studyname as a cluster indicator.

pred <- quickpred(data, mincor=0.3) # assignation based on pairwise correlaition
pred[, c("bdeath_ga")] <- 0 # as information of prediction is given by nelson allen variable na_b
pred[,"studyname"] <- -2 # define the cluster for imputation models at study level.
pred["studyname", "studyname"] <- 0

For all continuous variables, we added a post-processing step that censored all imputed values outside the observable range of each variable.

post["age"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(14, 50))"
post["bdeath_ga"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(0, 50))"
post["zikv_gan"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(0, 65))"
post["inf_weight"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(100, 7000))"
post["inf_head_circ_birth"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(15, 50))"
post["inf_length"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(18, 60))"

Finally we created the mice object, running 20 iterations for each 10 imputations.

 micesurv <- mice(data, predictorMatrix = pred,method=meth,post=post, maxit =20, m = 10)

Due to the exhaustive processing time, we run the simulations in a high performance computer, all the codes are save it in the subfolder codes.